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A representation of the genetic code as a six-dimensional Boolean hypercube is described. This 
structure is the result of the hierarchical order of the interaction energies of the bases in codon- 
anticodon recognition. In this paper it is applied to study molecular evolution m vivo and m vitro. In 
the first case we compared aligned positions in homologous protein sequences and found two different 
behaviors: a) There are sites in which the different amino acids may be explained by one or two 
"attractor nodes" (coding for the dominating amino acid(s)) and their one-bit neighbors in the codon 
hypercube, and b) There are sites in which the amino acids correspond to codons located in closed 
paths in the hypercube. In the second case we studied the "Sexual PCR"|^ experiment described by 
Stemmer Q and found that the success of this combination of usual PGR and recombination is in 
part due to the Gray code structure of the genetic code. 
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I. INTRODUCTION 

The genetic code is the biochemical system for gene 
expression. It deals with the translation, or decoding, of 
information contained in the primary structure of DNA 
and RNA molecules into protein sequences. Therefore 
the genetic code is both, a physico-chemical and a com- 
munication system. Physically, molecular recognition de- 
pends on the degree of complementarity between the in- 
teracting molecular surfaces (by means of weak interac- 
tions); informationally, a prerequisite to define a code is 
the concept of distinguishability. It is the physical indis- 
tinguishability of some codon-anticodon interaction en- 
ergies that makes the codons synonymous, and the code 
degenerate and redundant Q. 

In natural languages Q as well as in the genetic code 
the total redundancy is due to a hierarchy of constraints 
acting one upon another. The specific way in which the 
code departs from randomness is, by definition, its struc- 
ture. It is assumed that this structure is the result of the 



hierarchical order of the interaction energies of the bases 
in codon-anticodon recognition. The hypercube struc- 
ture of the genetic code as currently introduced Q will 
be described and its implications for molecular evolution 
and test-tube evolution experiments will be discussed. 
As we shall see the genetic code may be represented by a 
six-dimensional boolean hypercube in which the codons 
(actually the code-words; see below) occupy the vertices 
(nodes) in such a way that all kinship^ neighborhoods 
are correctly represented. This approach is a particular 
application to binary sequences of length six of the gen- 
eral concept of sequence-space, first introduced in coding 
theory by Hamming . 

A code-word is next to six nodes representing codons 
differing in a single property. Thus the hypercube si- 
multaneously represents the whole set of codons and 
keeps track of which codons are one-bit neighbors of 
each other. Different hyperplanes correspond to the four 
stages of the evolution of the code according to the Co- 
evolution Theory |^-^. Transitions within three of the 



'^Polymerase Ghain Reaction plus DNA shuffling 
^The term kinship means the relationship between members 
of the same family. 
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"columns" (four-dimensional cubes), consisting of the 
codon classes NGN, NAN, NCN, and NUN, lead to 
silent and conservative amino acid substitutions; while 
transitions in the same hyperplane (four-dimensional 
subspace belonging to any of the codon classes ANN, 
CNN, GNN or UNN) lead to non-conservative substi- 
tutions as frequently found in proteins. The proposed 
structure demonstrates that in the genetic code there 
is a good balance between conservatism and innovation. 
To illustrate these results several examples of the non- 
conservative variable positions of homologous proteins 
are discussed. Two different behaviors were found: 

i There are sites in which the different amino acids 
may be explained by one or two "attractor nodes" 
(coding for the dominating amino acid(s)) and their 
one-bit neighbors in the codon hypercube, and 

ii There are sites in which the amino acids correspond 
to codons located in closed paths in the hypercube. 

Very recently the rapid evolution of a protein in vitro 
by DNA shuffling has been accomplished by Stemmer [Q . 

This experiment, called by Smith "Sexual PGR", was 
further discussed in ||]. Smith recalls that Stemmer in- 
vestigated the 6eta-lactamase gene TEM-1 which has a 
very low activity against the antibiotic cefoxtamine^ Af- 
ter three cycles of mutagenesis, recombination and selec- 
tion he found the minimum inhibitory concentration to 
be 16, 000 times higher than that of the original clone. 

It will be shown that, without exception, the amino- 
acid replacements in TEM-1 mutants selected for high 
resistance to cefotaxime may be accounted by one bit 
changes of the corresponding codons. This shows that the 
structure of the code permits a very significant change in 
function of the coded protein by means of one-bit changes 
of some of the codons, provided that these mutations are 
integrated in a single polynucleotide by recombination. 



II. CODON-ANTICODON INTERACTION 

The four bases occurring in DNA (RNA) macro- 
molecules define the corresponding alphabet X : 
{A, C, G, T} or A : {A, C, G, U). Each base is 
completely specified by two independent dichotomic cat- 
egorizations (Fig. |l|): 

i according to its chemical type C : {R, F}, where 
R : {A, G) are purines and Y : (C, U) are pyrim- 
idines and 



ii according to ff-bonding, H : {W, S}, where 
W : {A, U) are weak and S : {C, G) are strong 
bases. 

w 



00 I ^ 10 




FIG. 1. Categorizations of the bases. The categorizations 
of the bases according to (i): chemical type C : {R, Y} where 
R : {A, G) are purines and Y : (C, U) axe pyrimidines, and 
(ii) according to if-bonding, Ti : {W, S], where W : {A, U) 
are weak and S : (C, G) strong bases. The third possible 
partition into imino/keto bases is not independent from the 
former ones and is irrelevant for the codon-anticodon inter- 
action. The binary representation of the bases is also shown. 
The first bit is the chemical type and the second one the 
H-bonding character, a, /3 and 7 are the transformations of 
the bases which form a KIein-4 group [8,11]. 

The third possible partition into imino/keto bases is 
not independent from the former ones. Denoting by Ci 
the chemical type and by Hi the iJ-bond category of the 
base Bi at position i of a codon our basic assumption 
says that the codon-anticodon interaction energy obeys 
the following hierarchical order: 

C2 > n2 > Ci > Hi > C3 > H3 ■ 

This means, that the most important characteristic de- 
termining the codon-anticodon interaction is the chemi- 
cal type of the base in the second position; the next most 
important characteristic is whether there is a weak or 
strong base in this position; then the chemical type of 
the first base and so on. 

The above assumption goes beyond the early quali- 
tative view that the optimization between stability and 
rate, that is always found for enzyme-substrate inter- 



^The minimum inhibitory concentration for Escherichia Coli 
bacteria carrying TEM-l-bearing plasmid is only 20 ng ml~^. 
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actions, also applies to the codon-anticodon interac- 
tion lH^. Besides, several authors have suggested that 
three bases are needed for effectively binding the adapter 
to the messenger. From this it maybe inferred that 
codon's size determines a range of codon-anticodon over- 
all interaction strength within which recognition can oc- 
cur. Genetic translation rate is limited, among other 
things, by codon-anticodon recognition which, in turn, 
depends on base-pair lifetimes in a given structural sit- 
uation. These life-times are influenced by the nature of 
the pairs: they are shorter for A — T than for G — C 
pairs pH ]. 

The bases are represented by the nodes of a 2-cube 
(Fig. 0). The first attribute is the chemical character 
and the second one is the hydrogen-bond character. Ex- 
tending this association to base triplets, each codon is 
associated in a unique way with a codeword consisting of 
six attribute values (see Table 1). 
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TABLE I. Gray code representation of the genetic code. 
In the first and fourth blocks the six-dimensional vectors 
(code-words) are shown. In the second and fifth blocks ap- 
pear the corresponding codons. Finally, in the third and sixth 
columns the amino acids in single letter notation. The first 
two digits correspond to the first base, the following two to 
the second base and the last two to the last base, according 
to the binary codification of the bases of Fig. 1. 
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In some of the hypercube directions single feature 
codon changes (one-bit code-word changes) produce syn- 
onymous or conservative amino acid substitutions in the 
corresponding protein (when the transitions occur in 
three of the 4-cubes displayed as "columns" in Figs. |^ 
and |4|); while in other directions lead to context de- 
pendent replacements which, in general, conserve only 
certain physical properties. However, if these proper- 
ties are the only relevant ones in the given context, the 
substitution has little effect on the protein structure as 
well. These low-constraint sites facilitate evolution be- 
cause they allow the transit between hypercube columns 
belonging to amino acids with very different physico- 
chemical properties (e.g. hydrophobic and hydrophilic 
amino acids, respectively). 
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FIG. 3. Each of the fat short dashed lines represent 8 edges, 
connecting the corresponding nodes of two three-dimensional 
cubes. The figure shows a four-dimensional cube using the 
symbolic fat drawn link (top) and the same cube using stan- 
dard representation. 
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FIG. 4. The hypercube representation of the genetic 
code. Each node represents a code-word (six-dimensional 
vector) of attribute values. However, for clarity of in- 
terpretation, the nodes are labeled with the correspond- 
ing codons (See Table 1 for the assignment of codons to 
vectors). The nodes and links mentioned in second ex- 
ample discussed in the text are shown. The edges con- 
nect: AGG ^ AGC, AGC ^ ACC, AGC ^ A AG, 
UGC ^ AGG, AGG ^ GGG, GCC ^ CCG, 
CGG ^ GAG, GAG ^ GAG, GAG ^ GUG, GAG ^ GAG 



FIG. 2. The six-dimensional hypercube. Each node 
is labeled with the corresponding amino acid in the sin- 
gle letter notation or terminator symbol. The fat short 
dashed lines represent a complex connection between two 
(three-dimensional) cubes. Such a line represents 8 edges 
each, connecting the corresponding nodes of two neighbored 
three-dimensional cubes (see fig. H). The cluster of amino 
acids of the first example discussed in the text is displayed by 
fat points at the corresponding nodes and dashed thin curved 
lines for the edges. 




III. GRAY CODE STRUCTURE OF THE 
GENETIC CODE 

An n-dimensional hypercube, denoted by Q„, consists 
of 2" nodes each addressed by a unique rt-bit identifi- 
cation number. A link exists between two nodes of Qn 
if and only if their node addresses differ in exactly one 
bit position. A link is said to be along dimension i if it 
connects two nodes which addresses differ to as the ith 
bit (where the least significant bit is referred to as the 
0th bit). Qq is illustrated in Fig. ^. Two nodes in a hy- 
percube are said to be adjacent if there is a link between 
them. The (Hamming) distance between any two cube 
nodes is the number of bits differing in their addresses. 
The number of transitions needed to reach a node from 
another node equals the distance between the two nodes. 
A d-dimensional sub-cube in Qn involves 2'^ nodes which 
addresses belong to a sequence of n symbols {0, I, *} 
in which exactly d of them are of the symbol * (i.e. the 
don't care symbol which value can be or I). 

The idea to propose a Gray Code representation of the 
Genetic Code goes back to Swanson |I2[ where this con- 
cept is explained in detail (see also [[13[). However, a 
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great number of different Gray Codes can be associated 
to the Genetic Code depending on the order of impor- 
tance of the bits in a code-word. In Table 1 our chosen 
Gray Code is displayed. It is constructed according to 
our main hypothesis 

C2 > 7^2 > Ci > Hi > C3 > Tis ■ 

For example, the first two lines of the table differ in 
the last bit corresponding to H3; which is the least sig- 
nificant bit; the second and the third lines differ in the 
next least significant bit, i.e. C3, and so forth. 

IV. THE STRUCTURE OF CODON DOUBLETS 

This section is more mathematical than the rest of the 
paper. It is not essential for the understanding of the 
rest of the paper. 

In a pioneering paper Danckwerts and Neubert jl^ ] 
discussed the symmetries of the sixteen B1B2 codon dou- 
blets in terms of the Klein-4 group of base transforma- 
tions. Here their result will be recast in a form of a 
decision-tree (Fig. ||) and their analysis will be extended 
to the doublets. They found the following struc- 

ture for the set M of B1B2 doublets: 

Starting from Ac generate the set: 

Mo = { [(1, 1) U {a, 1) U {a, [3) U (a, 7)] AC} 

= {AC, CC, CG, CU} 
Ml = [(l,l)U(/3,l)]Mo 
M2 = {a, a) Ml 

The sets Mi and M2 consist of four-fold and less than 
four-fold degenerate doublets, respectively. 
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Ml Ml M2 Ml M2 M2 

FIG. 5. Decision-tree of codon categories and redundancy 
distribution. The leaves are the sets of four-fold (Mi) and 
less than four-fold (M2) degenerate B1B2 doublets. 

The set M can be expressed as: 

M= [(1,1)U(6,I)] [(l,l)U(a,a)]Mo 



Where the base exchange operators a, /3, 7 are defined in 
Fig. 0. 

They showed that: "a) Mi and M2 are invariant by op- 
erating with (/?, 1) on Bi, but no operation on B2 leaves 
Ml or M2 invariant. Thus B2 carries more information 
than Bi and B2 is therefore more important for the sta- 
bility of Ml and M2 than Bi. . . A change of Bi with 
respect to its hydrogen bond property does not change 
the resulting amino acids if all doublets of either Mi or 
M2 are affected. 

Reversing supposition and conclusion. Mi and M2 may 
be defined as those doublet sets of 8 elements which are 
invariant under the (/?, l)-transformation. Then experi- 
ence shows that Mi and M2 are fourfold and less than 
fourfold degenerate respectively." 

Thus the third base degeneracy of a codon does not 
depend on the exact base Bi, but only on its H-hond 
property (weak or strong). 

The above results can be simply visualized as a 
decision-tree (Fig. |^). It can be seen from this figure 
that the redundancy of a codon is determined only by 
the 77-bond character of Bi and B2 : SSN codons (with 
6 77-bonds in B1B2) belong to Mi while WWN codons 
(with 4 i/-bonds in B1B2) belong to M2. However, for 
codons WSN and SWN (with 5 i7-bonds in B1B2) it 
is not possible to decide unless one has more informa- 
tion about the second base: WCN and SUN belong to 
Ml while WCN and SAN belong to M2. In all cases 
at most three attributes are necessary to determine the 
redundancy of a codon up to this point. Of course the 
non-degenerate codons {UAC for Methionine and UCC 
for Tryptophan) will require the specification of the six 
attributes. 

From the decision rules obtained from Fig. |^ it is clear 
that there are branches where the refinement procedure 
cannot continue (the branches which end in Mi) because 
no matter which base occupies the third codon position 
the degeneracy cannot be lifted. This imposes a limit 
to the maximum number of amino acids which can be 
incorporated to the code without recurring to a "frozen 
accident" hypothesis. Our proposal generalizes the "2- 
out-of-3" hypothesis of Lagerkvist ||l^ which refers only 
to codons in the SSN class. 

The sixteen B1B2 doublets can be represented as the 
vertices of a four-dimensional hypercube. Figure]^ shows 
that the sets Mi and M2 are located in compact regions. 
Notice that this figure differs from the one introduced by 
Bcrtman and Jungck ]l6|] who considered as basic trans- 
formations a and /?, instead of (3 and 7 as we did. Since 
the operator a changes two bits we do not consider it as 
basic. 



5 



uu 




UA 



FIG. 6. The four-dimensional hypercube representation of 
the sets Mi (dotted) and M2 (fat). 
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FIG. 7. The corresponding to fig. ^ hypercube of the sets 
M[ and M2 (see text). Notice that in both cases each set is 
located in a compact region. 

Let's consider now the structure of the set M' of -62^3 
doublets: Exactly as before, define the sets 

M'^ = {NC} 

M[ = [(1,1)U(1,/5)]M^ 

M2 = (a, a) M[ (alternatively M[ = (a, a) M2 ) , 

where M[ consists of the doublets B2B3 ending in a 
strong base (NS) and M2 of the doublets ending in a 
weak base (NW). 
Then 

M' = M[ U M2 

can be expressed as 

M' = [(l,l)U(l,/3)] [(l,l)U(a,«)]M^ . 



Notice that the operator acting on Mq has the same 
functional form as the operator acting on Mq above, ex- 
cept that P acts as the third base instead of the first. 

The sets M[ and are invariant under the (1, (3)- 
transformations. Then experience shows that the 32 
codons in the class NB2B3, with B2B3 in M{ or M2 
constitute a complete code codifying for the 20 amino 
acids and terminator signal (stop-codon), if allowance 
is made for deviating codon-assignments found in Mito- 
chondria ll^. For the codons in M[ this is true in the 
universal code; for codons in AU A should codify for 
M instead of / and U GA for W instead of stop signal. 
Both changes have been observed in Mitochondria. This 
more symmetric code has been considered more similar 
to an archetypal code than the universal code jl^ . Only 
after the last attribute H3 was introduced the universal 
code was obtained, with the split of AUR into AUA (I) 
and AUG (M) and UGR into UGG (W) and UGA (t). 

It has been speculated that primordial genes could 
be included in a 0.55 kb open reading frame |18|. The 
same authors calculated that with two stop codons this 
open reading frames would have appeared too frequently. 
From the present view the assignment of UGA to a stop 
codon was a late event that optimized this frequency (this 
interpretation differs from the one proposed in [|l8 19 
where a primordial code with three stop codons) is as- 
sumed. Other deviations of the universal code most likely 
also occurred in the last stages of the code's evolution. 

In the same way as before the sixteen 52^3 doublets 
can be represented as the vertices of a four-dimensional 
hypercube (Fig. ^). The sets M[ and are also lo- 
cated in compact regions. Codons with B2B3 in M[ are 
frequently used in eukaryotes. In contrary, codons with 
B2B^ in M2 are frequently used in prokaryots. The de- 
scribed structure of the code allows a modulation of the 
codon-anticodon interaction energy j20| . 



V. EXAMPLES 



Besides the results mentioned in the last section which 
refer to codon doublets, to further illustrate the signif- 
icance of proposed approach, we are going to consider 
several examples of molecular evolution. 

The first example (Fig. ||) refers to the alignment stud- 
ied using the method of hierarchical analysis of residue 
conservation by Livingstone and Barton (Fig. 2 in pl|). 
In position 11 appear the following amino acids R, W, H, 
G, D, which according to their approach have no prop- 
erties in common. In Fig. |^ this cluster of amino acids 
is shown. By looking at the Atlas of amino acid prop- 
erties we see that from the properties proposed by 
Grantham p3[ | (composition, polarity and volume) ap- 
parently the only requirement for the amino acids at this 
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site is to maintain a certain degree of polarity. From this 
observation we may conclude that most probably it is an 
external site. Simply by looking at such a diverse set of 
amino acids one can hardly realize that they have clus- 
tered codons. This clustering facilitates the occurrence 
of mutations that in the course of evolution were fixed, 
in view of the low physico-chemical requirements at the 
site. 

As a second example (Fig. U) let us consider site 33 
of the alignment of 67 SH2 domains, Fig. 6 of [0. We 
can see from Fig. || that the cluster around the codon 
CAC (H) explains, by one-bit changes, the amino acids 
-R, Q, L, H, D. Furthermore, a second cluster around 
the codon AGC (S) explains the aminoacids R, N, S, 
T. Finally, a silent change from AGC (S) to UCC (S) 
accounts for the minor appearance of the small, neutral 
amino acids, A, T, P. In a similar way the variation of 
the hyper-variable region of immunoglobulin kappa light 
FRl at position 18 can be explained (Fig. ^). The num- 
ber after the amino acid symbol in Fig. |^ is the number 
of times the amino acid occurs in the alignment in . 
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FIG. 8. The amino acid hypercube with the amino acids 
at position 18 of the variable region of kappa light chain dis- 
played. The number after the amino acid symbol is the num- 
ber of times the amino acid occurs in the alignment in [24] . 

As a third example consider the residue frequencies in 
226 globins displayed in Table 3 of the paper by Bashford 
et al. 1^^. From this table we find that there are variable 
positions in which one or two residues predominately oc- 
cur and the rest are only marginally represented and oth- 
ers in which the frequencies are more evenly distributed 
among the amino acids. As it can be easily shown, the 
first class of positions may be associated, at the codon 
level, with one (or two) attractor node(s) and its one-bit 
neighbors. The second one can be associated with closed 
trajectories in the hypercube. The corresponding figures 
are not included because of lack of space. 



Finally, let us discuss the "sexual PGR" experiment. 
In the paper by Smith |^ a table is displayed showing the 
positions in the TEM-1 gene where mutations occur, to- 
gether with the substitutions found in the variant genes 
ST-1, ST-2 and ST-4 which show increased resistance 
to cefotaxime. We refer to the mentioned paper for fur- 
ther details. Locating these mutations in the hypercube 
(Fig. 1^) one can easily convince oneself that all mutations 
may be accounted by one-bit changes at the codon level. 
Therefore only six codons (four or five aminoacids) are 
searched in each mutation and not 19 alternatives. This 
finding helps to explain why this in vitro realization of a 
"genetic algorithm" was so successful. 

It is well known in the field of Genetic Algorithms that 
a proper encoding is crucial to the success of an algo- 
rithm. Furthermore in p6t it is shown the superiority of 
Gray coding over binary coding for the performance of 
a genetic algorithm. As it was shown above the struc- 
ture of the genetic code is precisely the structure of a 
Gray code. Therefore it is our claim that this is one of 
the reasons why very efficient variants were found after 
very few rounds of recombination. Most probably other 
reasons are: the initial population was not random, but 
consisted of selected sequences and these sequences were 
very similar among themselves. This explanation of the 
results of Stemmer's experiment differs from the expla- 
nation advanced by Smith |9[ . 



VI. CONCLUDING REMARKS 

The present approach goes beyond the usual analyses 
in terms of single base changes, because it takes into ac- 
count the two characters of each base and therefore it 
represents one-bit changes. Besides, the base position 
within the codon is also considered. The fact that single 
bit mutations occur frequently is expected from proba- 
bilistic arguments. However, one could not expect, a pri- 
ori, that a cluster of mutations would correspond, at the 
amino acid level, to a cluster of amino acids fixed by nat- 
ural selection. We have found that this situation presents 
itself for many positions of homologous protein sequences 
of many different families (results not included). The 
structure of the code facilitates evolution: the variation 
found at the variable positions of proteins do not corre- 
sponds to random jumps at the codon level, but to well 
defined regions of the hypercube. Finally, the Gray code 
structure of the genetic code helps to explain the success 
of "Sexual PGR" experiments. 
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